{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#Introduction\n",
    "\n",
    "Idea behind decision tree: split the space of the attribute variables with recursive, binary splits, with the aim of high purity for all regions.\n",
    "\n",
    "The collection of paths to all regions makes a tree.\n",
    "\n",
    "## Vocabulary\n",
    "\n",
    "* _attribute_, or attribute variable: a dimension in which a data point has a value (typically excluding the target variable)\n",
    "\n",
    "* _target_ variable: the variable whose value is to be predicted \n",
    "\n",
    "* the attributes of the i-th data point are labeled X_i.\n",
    "\n",
    "* the value of the target variable for the i-th data point is labeled y_i. \n",
    "\n",
    "* Trees that predict a quantitative target variable are called _regression trees_, and trees that predict qualitative targets are called _classification trees_.\n",
    "\n",
    "# Play \n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "%pylab inline "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Get iris data and make a simple prediction"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from sklearn import datasets\n",
    "iris = datasets.load_iris()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import random"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Create training and test data sets"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "iris_X = iris.data\n",
    "iris_y = iris.target\n",
    "\n",
    "r = random.randint(0,100)\n",
    "np.random.seed(r)\n",
    "idx = np.random.permutation(len(iris_X))\n",
    "\n",
    "subset = 25\n",
    "\n",
    "iris_X_train = iris_X[idx[:-subset]]  # all but the last 'subset' rows\n",
    "iris_y_train = iris_y[idx[:-subset]]\n",
    "iris_X_test  = iris_X[idx[-subset:]]  # the last 'subset' rows\n",
    "iris_y_test  = iris_y[idx[-subset:]]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Train a classification tree"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from sklearn import tree\n",
    "clf = tree.DecisionTreeClassifier()\n",
    "clf = clf.fit(iris_X_train,iris_y_train)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Print the predicted class of iris"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "clf.predict(iris_X_train)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Based on the target values in the training set, calculate the training accuracy:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "def accuracy(x,y):\n",
    "    output = []\n",
    "    for i,j in zip(x,y):\n",
    "        if i == j:\n",
    "            output.append(1)\n",
    "        else:\n",
    "            output.append(0)\n",
    "    return np.mean(output)\n",
    "print(\"training accuracy: {}\".format(accuracy(clf.predict(iris_X_train),iris_y_train)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "And here's the testing accuracy:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "print(\"testing accuracy: {}\".format(accuracy(clf.predict(iris_X_test),iris_y_test)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Visualize the tree:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from sklearn.externals.six import StringIO # StringIO streams data as a string to \"output file\"\n",
    "from IPython.display import Image # need Image to display inline\n",
    "\n",
    "# export the tree data as a string to a file\n",
    "dot_data = StringIO() \n",
    "tree.export_graphviz(clf, out_file=dot_data) \n",
    "\n",
    "# compatible with modern pyparsing\n",
    "import pydotplus as pydot\n",
    "# or olde-timey\n",
    "# import pydot\n",
    "graph = pydot.graph_from_dot_data(dot_data.getvalue()) \n",
    "Image(graph.create_png())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# How it works\n",
    "\n",
    "## Split definition\n",
    "\n",
    "To decide which variable is considered at each node, and what the split point is, we need a metric to minimize: "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "Image(filename=\"gini.png\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "where p_mk is the proportion of training data in the m-th region that are from the k-th class.\n",
    "\n",
    "Values of p_mk close to 0 or 1 represent better purity, so we minimize G.\n",
    "\n",
    "\n",
    "## Cross validation: a side note\n",
    "\n",
    "Cross validation is a generalization of the testing/training data set paradigm. A reasonable test for the validity of a tree is to re-sample the training and testing data set, re-fitting the tree each time. Small variations in the resulting trees indicate a stable model.\n",
    "\n",
    "\n",
    "# A Problematic Example"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "classifier_1 = tree.DecisionTreeClassifier()\n",
    "X = numpy.array([[0],[1],[2],[3],[4],[5],[6],[7],[8],[9],[10]])\n",
    "Y = numpy.array([0,1,2,3,4,5,6,7,8,9,10])\n",
    "classifier_1 = classifier_1.fit(X,Y)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "classifier_1.predict(X)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "## print the tree\n",
    "\n",
    "# export the tree data as a string to a file\n",
    "dot_data = StringIO()\n",
    "tree.export_graphviz(classifier_1, out_file=dot_data) \n",
    "\n",
    "#\n",
    "import pydotplus as pydot\n",
    "graph = pydot.graph_from_dot_data(dot_data.getvalue()) \n",
    "Image(graph.create_png())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The tree shown above is overtrained. Let's limit the depth."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "classifier_2 = tree.DecisionTreeClassifier(max_depth=3)\n",
    "classifier_2 = classifier_2.fit(X,Y)\n",
    "classifier_2.predict(X)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "dot_data = StringIO() \n",
    "tree.export_graphviz(classifier_2, out_file=dot_data) \n",
    "\n",
    "graph = pydot.graph_from_dot_data(dot_data.getvalue()) \n",
    "Image(graph.create_png())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Take-away:\n",
    "\n",
    "* trees aren't great at predicting linear relationships between attrtibute and target variables. But standard linear regression _is_.\n",
    "\n",
    "* tree size needs to be controlled to avoid over training"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Regression Trees\n",
    "\n",
    "## Concepts\n",
    "\n",
    "* The predicted target variable is the _mean_ of all the training target variable in the region\n",
    "\n",
    "* The split betwee R_1 and R_2 minimizes the following:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "Image(filename=\"rss.png\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "where x_i and y_i are the attribute and target variables for the i-th training data point, and y_hat is the mean of the target variables in the region."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Example\n",
    "\n",
    "Let's create an example with a noisy sine function. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Create a random dataset\n",
    "rng = np.random.RandomState(1)\n",
    "# Set the range to [0,5] and sort it numerically\n",
    "X = np.sort(5 * rng.rand(80, 1), axis=0)\n",
    "# for target, take the sine of the data, and place it in an array\n",
    "y = np.sin(X).ravel()\n",
    "# add some noise to every fifth point\n",
    "y[::5] += 3 * (0.5 - rng.rand(16))\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Test a set of regression trees that have different depth limits."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# use a regression tree model\n",
    "from sklearn.tree import DecisionTreeRegressor\n",
    "\n",
    "clf_1 = DecisionTreeRegressor(max_depth=2)\n",
    "clf_2 = DecisionTreeRegressor(max_depth=5)\n",
    "clf_3 = DecisionTreeRegressor()\n",
    "clf_1.fit(X, y)\n",
    "clf_2.fit(X, y)\n",
    "clf_3.fit(X, y)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# generate test data in correct range, and place each pt in its own array \n",
    "X_test = np.arange(0.0, 5.0, 0.01)[:, np.newaxis]\n",
    "y_1 = clf_1.predict(X_test)\n",
    "y_2 = clf_2.predict(X_test)\n",
    "y_3 = clf_3.predict(X_test)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "import matplotlib.pyplot as plt\n",
    "\n",
    "plt.figure()\n",
    "plt.scatter(X, y, c=\"k\", label=\"data\")\n",
    "plt.plot(X_test, y_1, c=\"g\", label=\"max_depth=2\", linewidth=2)\n",
    "plt.plot(X_test, y_2, c=\"r\", label=\"max_depth=5\", linewidth=2)\n",
    "plt.plot(X_test, y_3, c=\"b\", label=\"max_depth=inf\", linewidth=1)\n",
    "\n",
    "plt.xlabel(\"data\")\n",
    "plt.ylabel(\"target\")\n",
    "plt.title(\"Decision Tree Regression\")\n",
    "plt.legend()\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "dot_data = StringIO()\n",
    "tree.export_graphviz(clf_1, out_file=dot_data)\n",
    "tree.export_graphviz(clf_2, out_file=dot_data)\n",
    "tree.export_graphviz(clf_3, out_file=dot_data)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "graph = pydot.graph_from_dot_data(dot_data.getvalue()) "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Visualization of tree with depth=2"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "Image(graph[0].create_png())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Visualization of tree with depth=5"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "Image(graph[1].create_png())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Visualization of tree with no limitation on depth."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "Image(graph[2].create_png())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Options for deal with overfitting:\n",
    "\n",
    "* maximum depth\n",
    "\n",
    "* minimum training data points per region\n",
    "\n",
    "* pruning\n",
    "\n",
    "\n",
    "# Pruning\n",
    "\n",
    "* Not implemented in scikit-learn\n",
    "\n",
    "* uses cross validation to remove nodes from the tree in such a way that one makes an optimal tradeoff between the tree's complexity and its fit to the training data.\n",
    "\n",
    "# Bagging\n",
    "\n",
    "* create an ensemble of trees, based on a subdivision of the training data\n",
    "\n",
    "* average the results of the ensemble\n",
    "\n",
    "# Random forests\n",
    "\n",
    "* deal with the fact that tree production is greedy, and always uses the strongest split first.\n",
    "\n",
    "* base each split on a random subset of attributes\n",
    "\n",
    "* combine as in bagging\n",
    "\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.8"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}
